The importance of precise tokenizing for deep grammars

نویسندگان

  • Martin Forst
  • Ronald M. Kaplan
چکیده

We present a non-deterministic finite-state transducer that acts as a tokenizer and normalizer for free text that is input to a broad-coverage LFG of German. We compare the basic tokenizer used in an earlier version of the grammar and the more sophisticated tokenizer that we now use. The revised tokenizer increases the coverage of the grammar in terms of full parses from 68.3% to 73.4% on sentences 8,001 through 10,000 of the TiGer Corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Domain-Independent Deep Linguistic Processing: Ensuring Portability and Re-Usability of Lexicalised Grammars

In this paper we illustrate and underline the importance of making detailed linguistic information a central part of the process of automatic acquisition of large-scale lexicons as a means for enhancing robustness and at the same time ensuring maintainability and re-usability of deep lexicalised grammars. Using the error mining techniques proposed in (van Noord, 2004) we show very convincingly ...

متن کامل

A Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN) for Unsupervised Discovery of Linguistic Units and Generation of High Quality Features

This paper summarizes the work done by the authors for the Zero Resource Speech Challenge organized in the technical program of Interspeech 2015. The goal of the challenge is to discover linguistic units directly from unlabeled speech data. The Multi-layered Acoustic Tokenizer (MAT) proposed in this work automatically discovers multiple sets of acoustic tokens from the given corpus. Each acoust...

متن کامل

Integrating Finite-state Technology with Deep LFG Grammars1

Researchers at PARC were pioneers in developing finite-state methods for applications in computational linguistics, and one of the original motivations was to provide a coherent architecture for the integration of lower-level lexical processing with higher-level syntactic analysis (Kaplan and Kay, 1981; Karttunen et al., 1992; Kaplan and Kay, 1994). Finite-state methods for tokenizing and morph...

متن کامل

Alternating Regular Tree Grammars in the Framework of Lattice-Valued Logic

In this paper, two different ways of introducing alternation for lattice-valued (referred to as {L}valued)  regular tree grammars and {L}valued top-down tree automata are compared. One is the way which defines the alternating regular tree grammar, i.e., alternation is governed by the non-terminals of the grammar and the other is the way which combines state with alternation. The first way is ta...

متن کامل

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a state-of-the-art text classifier based on convolutional neural networks. Despite potentially affecting the final performance of any given model, this aspect has not received a substantial interest in the deep learning l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006